value function estimate
Belief-DependentMacro-ActionDiscovery inPOMDPsusingtheValueofInformation
This property can be observed directly from Eq. 2 when integration is replaced by summation. Closed-loop First, we construct the standard, closed-loopα-vectors, which represent the value function under closed loop dynamics [1,5]. Each point in the scatter plot represents a paired experiment with identical target dynamics.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.04)
- Information Technology > Game Theory (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Robots (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)
Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning
Baheri, Ali, Sharooei, Zahra, Salgarkar, Chirayu
We present Wasserstein Adaptive Value Estimation for Actor-Critic (WAVE), an approach to enhance stability in deep reinforcement learning through adaptive Wasserstein regularization. Our method addresses the inherent instability of actor-critic algorithms by incorporating an adaptively weighted Wasserstein regularization term into the critic's loss function. We prove that WAVE achieves $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate for the critic's mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization. Using the Sinkhorn approximation for computational efficiency, our approach automatically adjusts the regularization based on the agent's performance. Theoretical analysis and experimental results demonstrate that WAVE achieves superior performance compared to standard actor-critic methods.
- North America > United States > New York > Monroe County > Rochester (0.04)
- Asia > Middle East > Jordan (0.04)
Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost
Nakhleh, Khaled, Eksin, Ceyhun, Ekin, Sabit
This paper proposes an agent-based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi-agent Markov Decision Processes (MDPs), in which agents incur a Kullback-Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with finite policy evaluation rollout converge to the optimal value function and an optimal joint policy asymptotically.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)